NEON provides data varying over a broad selection of ecosystems and phylum that are collected. We are going to analyze the methods in which the data from the site, Great Basin, Onaqui, Utah, USA is collected as well as data regarding the phylum, Methylomirabilota is collected. With reference to this data, various questions are asked in order to better understand these findings. The results will contain graphs, tables and phylogeny trees in order to visually comprehend the data collected by NEON. Using these data and visualizations, we will be able to analyze and comprehend the types of ecosystems in the USA, as well as how the different phylum in these ecosystems affect the sites.
NEON is an observational facility that’s purpose is to collect ecological data. With the data collected, they work with this information to appreciate and recognize the ecosystems in the United States. Additionally, they are able to look at this information over time to see how the ecosystems in America are changing, and perhaps what is changing them. NEON’s motivation is to eventually maintain the ecosystems in a sustainable environment using their partners and community. With the data NEON has collected, we have taken the information specifically regarding Great Basin, Onaqui, Utah, USA and Methylomirabilota. We are transforming their data into visualizations so that we can better understand their findings and display it in a more inclusive manner. We then can analyze the tables and graphs and make claims about the specific sites and phylum and how they relate to each other.
NEON is a foundation also known as National Ecological Observatory Network. The purpose of NEON is to find ecological data based on different sites and taxonomic rankings to better understand the changes that are made over time. Specifically, we are looking at the terrestrial site of Great Basin, Onaqui, Utah, USA. At this specific site, we are going to take a look at which phylum, class and family of bacteria are located at this site, and how this affects the ecology of the location. Additionally, we are also going to look at the phylum, Methylomirabilota and analyze the presence of this phylum in different locations.
Onaqui is a terrestrial field site which is located about 50 miles southwest of Salt Lake City. The climate of this site is described to be warm with little precipitation, arid with hot summers and cold winters. There are a series of soils collected at this site which includes, taylorsflat, sterling, sevy, strevell and many more. The vegetation found at this site is located predominately on the eastern side of the site as well as the base of the mountains, and up the woodlands. The fauna found at this site includes coyotes, jackrabbits, rattlesnakes and other small mammals and birds. The current land management is under the control of the Bureau of Land Management. This allows for many different uses of the site including data collection, recreation as well as hunting.
Methylomirabilota is a bacteria which belongs to the phylum also known as NC10. This bacteria is known for its biogeochemical impact on different locations in which it is found. However, there is still much to be discovered about the methylomirabilota phylum. The main function of this bacteria is its ability to preform oxidation of methane as well as denitrification. This is done aerobically. The bacteria is found in a diverse selection of habitats, which is highlighted in the results section. The importance of methylomirabilota is that it contributes to the methane regulation and control in the ecosystem.
NEON used multiple data collecting methods to receive the samples for each site as well as phylum. For each site the data was collected to report the weather, climate land cover and species within the ecosystem. The 3 methods that were used for data collection was Airborne Remote Sensing, Automated Instruments and observational sampling. The Airborne Remote sensing used spectrometers, digital cameras, lidar, GPS and Inertial Measurement unit in order to observe data. The automated instruments were used to collect soil, surface water and ground water to examine patterns as well as the bacteria found in these locations. Finally, observations were split into aquatic observations and terrestrial observations in which species diversity and environmental or chemical properties could be examined. The data was presented on the NEON website in which we were able to retrieve the data for our corresponding site and phylum. This specific data was then translated into a csv file so that it could successfully be imported into Rmarkdown. This is when the data could then be configured to represent different graphs and tables to present the data. This was done by asking site or phylum specific questions with respect to the data. The questions asked during this process is mentioned below.
Site Specific:
Which MAG’s are found within the subplots of Onaqui?
What is the taxonomic breakdown at Onaqui?
Are there any novel bacteria found in Onaqui?
What is the correlation of site/ecosystem subtype to soil temperature and soil pH?
Phylum Specific:
Where in the US are the phylum Methylomirabilota found?
Is Methylomirabilota found in our selected site?
Where is each order found?
What are the individual and co assemblies for Methylomirabilota?
What is the soil temperature for this phylum?
What are the ecosystem sub types for Methylomirabilota?
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Question: Which MAG’s are found within the subplots of Onaqui?
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_edArchaea.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 46 rows [3, 4, 24, 25, 26,
## 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 54, 232, 267, ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 446 rows [1, 2, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 46, 50, 53, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [4, 7, 8, 236,
## 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
## ...].
## Rows: 176 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, S...
## dbl (4): taxon_oid, IMG Genome ID, Genome Size * assembled, Gene Count * a...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 176 × 12
## taxon_oid Domain `Sequencing Status` `Study Name` Genome Name / Sample…¹
## <dbl> <chr> <chr> <chr> <chr>
## 1 3300069219 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 2 3300069216 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 3 3300062116 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 4 3300060668 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 5 3300060914 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 6 3300069208 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 7 3300067032 *Microbio… Permanent Draft Terrestrial… NEON combined assembly
## 8 3300061641 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 9 3300069224 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## 10 3300069268 *Microbio… Permanent Draft Terrestrial… Terrestrial soil micr…
## # ℹ 166 more rows
## # ℹ abbreviated name: ¹`Genome Name / Sample Name`
## # ℹ 7 more variables: `Sequencing Center` <chr>, `IMG Genome ID` <dbl>,
## # `GOLD Study ID` <chr>, Latitude <chr>, Longitude <chr>,
## # `Genome Size * assembled` <dbl>, `Gene Count * assembled` <dbl>
## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
## is available with R version '4.4'; see https://bioconductor.org/install
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'treeio'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.3.3/lib/R/library
## packages:
## boot, codetools, lattice, survival
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ggtreeExtra'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.3.3/lib/R/library
## packages:
## boot, codetools, lattice, survival
## ggtree v3.10.1 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
##
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:tidyr':
##
## expand
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
##
## Attaching package: 'rphylopic'
## The following object is masked from 'package:ggimage':
##
## geom_phylopic
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Attaching package: 'tidytree'
## The following object is masked from 'package:treeio':
##
## getNodeNum
## The following object is masked from 'package:stats':
##
## filter
##
## Attaching package: 'ape'
## The following objects are masked from 'package:tidytree':
##
## drop.tip, keep.tip
## The following object is masked from 'package:treeio':
##
## drop.tip
## The following object is masked from 'package:ggtree':
##
## rotate
## The following object is masked from 'package:dplyr':
##
## where
##
## Attaching package: 'TreeTools'
## The following object is masked from 'package:tidytree':
##
## MRCA
## The following object is masked from 'package:treeio':
##
## MRCA
## The following object is masked from 'package:ggtree':
##
## MRCA
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
##
## Attaching package: 'phytools'
## The following object is masked from 'package:TreeTools':
##
## as.multiPhylo
## The following object is masked from 'package:treeio':
##
## read.newick
## ggtreeExtra v1.12.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>%
mutate_at("Domain", na_if,"") %>%
mutate_at("Phylum", na_if,"") %>%
mutate_at("Class", na_if,"") %>%
mutate_at("Order", na_if,"") %>%
mutate_at("Family", na_if,"") %>%
mutate_at("Genus", na_if,"") %>%
mutate_at("Species", na_if,"") %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>%
rename(`Genome Name` = `Genome Name / Sample Name`) %>%
filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
filter(str_detect(`Genome Name`, 'WREF plot', negate = T)) ## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
# remove -COMP from genomicsSampleID
mutate_at("genomicsSampleID", str_replace, "-COMP", "")## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date (1): collectionDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
left_join(NEON_metagenomes, by = "Sample Name") %>%
left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
rename("label" = "Bin ID")tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")# Make a vector with the internal node labels
node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)
# Search for your Phylum or Class to get the node
grep("Methylomirabilota", node_vector_bac, value = TRUE)## [1] "'0.999:p__Methylomirabilota; c__Methylomirabilia'"
## [1] 2651
# Make a vector with the internal node labels
node_vector_arc = c(tree_arc$tip.label,tree_arc$node.label)
# Search for your Phylum or Class to get the node
grep("p__", node_vector_arc, value = TRUE)## [1] "'1.0:p__Halobacteriota; c__Methanosarcinia; o__Methanosarcinales; f__Methanosarcinaceae; g__Methanosarcina'"
## [2] "'1.0:p__Thermoplasmatota; c__E2; o__JACPAO01; f__JAHFTW01'"
## [3] "'1.0:p__Methanobacteriota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobacterium_B; s__Methanobacterium_B sp003151535'"
## [4] "'1.0:p__Thermoproteota'"
## [1] 46 50 54 55
# First need to preorder tree before extracting. N
tree_bac_preorder <- Preorder(tree_bac)
tree_Methylomirabilota <- Subtree(tree_bac_preorder, 1712)NEON_MAGs_Methylomirabilota <- NEON_MAGs_metagenomes_chemistry %>%
filter(Phylum == "Methylomirabilota")NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>%
filter(`Site ID.x` == "ONAQ")NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>%
filter(`Site ID.x` == "ONAQ") %>%
filter(Domain == "Bacteria")tree_bac_ONAQ_MAGs <-drop.tip(tree_bac,tree_bac$tip.label[-match(ONAQ_MAGs_label, tree_bac$tip.label)])ggtree(tree_bac_ONAQ_MAGs, layout="circular") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point(mapping=aes(color=Phylum))This is a phylogeny tree of the phylums present in the location Onaqui. These are the MAG’s specific to this site, however, just shown in a different visualization. In the image, it shows that the majority of the phylum are very close to each other on the tree, which means they are very closely related.
NEON_MAGs_bact_ind <- NEON_MAGs %>%
filter(Domain == "Bacteria") %>%
filter(`Assembly Type` == "Individual") NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Subplot)) +
geom_bar() +
coord_flip()## $title
## [1] "MAG Count for each Subplot"
##
## attr(,"class")
## [1] "labels"
This graph is showing the subplot for the site, Onaqui, Great Basin, Utah, USA. Additionally, it is showing the MAG count of each type of bacteria that is present in that subplot. Actinobacteriota is found in the most subplots, and found most frequently in subplot 004.
Question: What is the taxonomic breakdown at Onaqui?
NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
geom_bar(position = "dodge") +
coord_flip()The graph above is showing the count of each site according to each phylum. The colors represent how much of the count is in each location. When looking at Methylomirabilota, there is a very small count of this phylum in the USA. The majority of this is in Texas, USA. The site Onaqui has the majority of it’s count in Actinobacteriota.
NEON_MAGs_bact_ind %>%
ggplot(aes(x = Phylum)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
coord_flip() +
facet_wrap(vars(Site), scales = "free", ncol = 2)This image gives a closer look at each site that was demonstrated in the graph before. This shows that specifically in our site, Onaqui, there is a drastic change in the count between Actinobacteriota and the rest of the phylum at that site.
This image is showing the phylum count that is found at this specific location. This shows that Actinobacteria has the highest count at this location and Desulfobacteriota_B has the lowest count. The phylum Methylomirabilota is not found.
This image shows the count of each order in each phylum found at Onaqui. Actinobacteriota has the largest variety of order within its count, with the largest amount being from the Entotheonellales order.
NEON_utah %>%
ggplot(aes(x = fct_rev(fct_infreq(Order)), fill = `Family`)) + geom_bar() + coord_flip() +
labs(title = "Family in each order", y = "Count", x = "Order")This graph is showing the Family in each order as a break down of what is present at Onaqui. It is shown that the families WHSQ01, 70-9, soilrubrobacteraceae have the highest overall counts specifically in the orders of CADDZG01 and soilrubrobacterales.
Question: Are there any novel bacteria found in Onaqui?
NEON_MAGs_bact_ind %>%
filter(is.na(Class) | is.na(Order) | is.na(Genus) | is.na(Family) | is.na(Phylum) | is.na(Domain))%>%
ggplot(aes(x = fct_infreq(Site))) +
geom_bar() +
coord_flip()This graph is showing the count of novel bacteria found at each site. The site we have selected for, Great, Basin, Onaqui, Utah, USA has a count of ~28 novel bacteria.
Question: What is the correlation of site/ecosystem subtype to soil temperature and soil pH?
This graph is showing the soil temperature for each site. The far left box plot is showing our site with has a range of soil temperature from 11-16°.
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_edArchaea.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 46 rows [3, 4, 24, 25, 26,
## 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 54, 232, 267, ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 446 rows [1, 2, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 46, 50, 53, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [4, 7, 8, 236,
## 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
## ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
rename(`Genome Name` = `Genome Name / Sample Name`) %>%
filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
filter(str_detect(`Genome Name`, 'WREF plot', negate = T))## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
# remove -COMP from genomicsSampleID
mutate_at("genomicsSampleID", str_replace, "-COMP", "")## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date (1): collectionDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kable(
NEON_chemistry_description <- read_tsv("data/NEON/neon_soilChem1_metadata_descriptions.tsv")
)## Rows: 23 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): fieldName, description, dataType, units
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| fieldName | description | dataType | units |
|---|---|---|---|
| siteID | NEON site code | string | NA |
| plotID | Plot identifier (NEON site code_XXX) | string | NA |
| sampleID | Identifier for sample | string | NA |
| horizon | Organic or mineral soil | string | NA |
| genomicsSampleID | Identifier for a genomics sample | string | NA |
| d15N | Measure of the ratio of 15N:14N in a sample, relative to atmospheric N2 | real | permill |
| organicd13C | Measure of the ratio of 13C:12C in soil organic carbon, relative to Vienna Pee Dee Belemnite | real | permill |
| nitrogenPercent | Percent nitrogen in a sample on a dry weight basis | real | percent |
| organicCPercent | Percent organic carbon in a sample on a dry weight basis | real | percent |
| CNratio | Ratio of carbon to nitrogen concentration in a sample on a dry weight basis | real | NA |
| nlcdClass | National Land Cover Database Vegetation Type Name | string | NA |
| subplotID | Identifier for the NEON subplot | string | NA |
| coreCoordinateX | x location of the soil core relative to the SW corner | real | meter |
| coreCoordinateY | y location of the soil core relative to the SW corner | real | meter |
| decimalLatitude | The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area | real | decimalDegree |
| decimalLongitude | The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area | real | decimalDegree |
| elevation | Elevation (in meters) above sea level | real | meter |
| sampleTiming | Timing of the sampling event with regard to the field season | string | NA |
| soilTemp | In-situ temperature of soil at approximately 10 cm depth | real | degree |
| sampleTopDepth | Depth below the soil surface of the top of a soil sample | real | centimeter |
| sampleBottomDepth | Depth below the soil surface of the bottom of a soil sample | real | centimeter |
| soilInWaterpH | pH value of soil measured in water solution | real | pH |
| soilInCaClpH | pH value of soil measured in calcium chloride solution | real | pH |
## # A tibble: 6 × 5
## `Sample Name` `Site ID.x` `Ecosystem Subtype.x` `Site ID.y`
## <chr> <chr> <chr> <chr>
## 1 ONAQ_004-M-20210525 ONAQ Shrubland ONAQ
## 2 ONAQ_010-M-20210526 ONAQ Shrubland ONAQ
## 3 ONAQ_008-M-20210524 ONAQ Shrubland ONAQ
## 4 ONAQ_002-M-20210524 ONAQ Shrubland ONAQ
## 5 ONAQ_005-M-20210527 ONAQ Shrubland ONAQ
## 6 ONAQ_003-M-20210527 ONAQ Shrubland ONAQ
## # ℹ 1 more variable: `Ecosystem Subtype.y` <chr>
NEON_metagenomes_site %>%
left_join(NEON_chemistry_site, by = c("Sample Name" = "genomicsSampleID"))## # A tibble: 6 × 5
## `Sample Name` `Site ID` `Ecosystem Subtype` siteID nlcdClass
## <chr> <chr> <chr> <chr> <chr>
## 1 ONAQ_004-M-20210525 ONAQ Shrubland <NA> <NA>
## 2 ONAQ_010-M-20210526 ONAQ Shrubland <NA> <NA>
## 3 ONAQ_008-M-20210524 ONAQ Shrubland <NA> <NA>
## 4 ONAQ_002-M-20210524 ONAQ Shrubland <NA> <NA>
## 5 ONAQ_005-M-20210527 ONAQ Shrubland <NA> <NA>
## 6 ONAQ_003-M-20210527 ONAQ Shrubland <NA> <NA>
NEON_metagenomes_site %>%
left_join(NEON_chemistry_site, by = c("Sample Name" = "genomicsSampleID"))## # A tibble: 6 × 5
## `Sample Name` `Site ID` `Ecosystem Subtype` siteID nlcdClass
## <chr> <chr> <chr> <chr> <chr>
## 1 ONAQ_004-M-20210525 ONAQ Shrubland <NA> <NA>
## 2 ONAQ_010-M-20210526 ONAQ Shrubland <NA> <NA>
## 3 ONAQ_008-M-20210524 ONAQ Shrubland <NA> <NA>
## 4 ONAQ_002-M-20210524 ONAQ Shrubland <NA> <NA>
## 5 ONAQ_005-M-20210527 ONAQ Shrubland <NA> <NA>
## 6 ONAQ_003-M-20210527 ONAQ Shrubland <NA> <NA>
## # A tibble: 6 × 5
## `Sample Name` `Site ID` `Ecosystem Subtype` genomicsSampleID nlcdClass
## <chr> <chr> <chr> <chr> <chr>
## 1 ONAQ_004-M-20210525 ONAQ Shrubland <NA> <NA>
## 2 ONAQ_010-M-20210526 ONAQ Shrubland <NA> <NA>
## 3 ONAQ_008-M-20210524 ONAQ Shrubland <NA> <NA>
## 4 ONAQ_002-M-20210524 ONAQ Shrubland <NA> <NA>
## 5 ONAQ_005-M-20210527 ONAQ Shrubland <NA> <NA>
## 6 ONAQ_003-M-20210527 ONAQ Shrubland <NA> <NA>
Table_9 %>%
ggplot(aes(x = fct_infreq(`Ecosystem Subtype`), y = soilTemp, color = Order)) +
geom_point() +
coord_flip()## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
This graph is demonstrating The ecosystem sub types with which order is present at that ecosystem as well as what the soil temperature is. At the bottom, the grasslanbs have various counts of the Rokubacteriales at different soil temperatures.
Table_9 %>%
ggplot(aes(x = fct_infreq(`nlcdClass`), y = soilInCaClpH, color = Family)) +
geom_point() +
coord_flip()## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
This graph is showing the ecosystem sub types as well as what the soil pH is for the families at that location. The CSP1-6 family has various locations of ecosystems and a variety of soil pH. However,n the 2-02-FULL-66-22 family is only located in the emergent herbaceous wetlands with a soil pH of ~7.
Question: What are all the taxa at Onaqui?
This graph is showing all of the taxa that is present at this site.
The specific site that is observed is the terrestrial site Great Basin, Onaqui, Utah, USA. With the data collected through NEON, we were able to determine the taxonomic breakdown of this location as well as the soil temperature and pH at this location. The taxonomic breakdown includes the phylum count, and the orders in each phylum, family count, MAG count by class. The data shows that the most abundant phylum present at Onaqui is the Actinobacteriota phylum. This phylum also shows to have various counts of different orders within this phylum, reaching a total count of 35. The subplot graph is displaying the count of which orders are present in each subplot of the site. Subplot 004 has the highest MAG count of ~30, as well as the presence of almost all 14 classes present at the entire site of ONAQ. The graph indicating the family count at Onaqui shows that the highest count of 5 at this site is WHSQ-01, with 70-9 as the next highest count with 4. From the NEON data we were also able to visualize the soil temperature and pH at this site. The soil temperature at Onaqui is within a range of ~12-16°. The median temperature of the soil at this location is closest to the 3rd quartile which is right below 15°. The soil temperature at Onaqui has overlapping temperatures with many other sites studied through NEON. The pH of the soil at this site is illustrated in the graph which shows which family is present at that soil pH. The family CSP1-6 is found at soil pH at a range from 5-7. The family 2-02-FULL-66-22 is only present at the soil pH of ~7.
Question: Where in the US are the phylum of Methylomirabilota found? Is Methylomirabilota found in our selected site?
NEON_MAGS_table <- NEON_MAGs_bact_ind %>%
filter(Phylum=='Methylomirabilota')
datatable(
NEON_MAGS_table %>%
count(Site, sort = TRUE))This table is showing the 5 locations in which Methylomirabilota is found, as well as the count. This is showing that the highest amount of our phylum found is in National Grasslands LBJ, Texas, USA. Additionally, the Methylomirabilota is not found in our selected site, Utah.
NEON_MAGS_table %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Genus`)) + geom_bar() +
coord_flip() +
labs(title = "Genus at each site", y= "Count", x = "Site")This graph is showing the genus that is present at each site. The genus is a smaller group of the taxonomic ranking. It shows which genus from the Methylomirabilota phylum is located in which location. The site with the most genus’ is Texas, USA. Additionally, the genus AR12 is most abundant here.
Question: Where is each order found?
This is showing the specific order from the phylum Methylomirabilota at each site found. It shows that 4 of the sites have presence of Rokubacteriales while 1 site has the order of Methylomirabirales present.
Question: What are the individual and co assemblies for Methylomirabilota?
This is the co-assembly of Methylomirabilota at our site.
This is the individual assembly of Methylomirabilota at our site.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("treeio")## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'treeio'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.3.3/lib/R/library
## packages:
## boot, codetools, lattice, survival
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ggtreeExtra'
## Installation paths not writeable, unable to update packages
## path: /opt/R/4.3.3/lib/R/library
## packages:
## boot, codetools, lattice, survival
library(tidyverse)
library(knitr)
library(ggtree)
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)
library(treeio)
library(tidytree)
library(ape)
library(TreeTools)
library(phytools)
library(ggnewscale)
library(ggtreeExtra)
library(ggstar)NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>%
mutate_at("Domain", na_if,"") %>%
mutate_at("Phylum", na_if,"") %>%
mutate_at("Class", na_if,"") %>%
mutate_at("Order", na_if,"") %>%
mutate_at("Family", na_if,"") %>%
mutate_at("Genus", na_if,"") %>%
mutate_at("Species", na_if,"") %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>%
rename(`Genome Name` = `Genome Name / Sample Name`) %>%
filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
filter(str_detect(`Genome Name`, 'WREF plot', negate = T)) ## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") ## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
# remove -COMP from genomicsSampleID
mutate_at("genomicsSampleID", str_replace, "-COMP", "")## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date (1): collectionDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
left_join(NEON_metagenomes, by = "Sample Name") %>%
left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
rename("label" = "Bin ID")tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")# Make a vector with the internal node labels
node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)
# Search for your Phylum or Class to get the node
grep("Methylomirabilota", node_vector_bac, value = TRUE)## [1] "'0.999:p__Methylomirabilota; c__Methylomirabilia'"
## [1] 2651
# Make a vector with the internal node labels
node_vector_arc = c(tree_arc$tip.label,tree_arc$node.label)
# Search for your Phylum or Class to get the node
grep("p__", node_vector_arc, value = TRUE)## [1] "'1.0:p__Halobacteriota; c__Methanosarcinia; o__Methanosarcinales; f__Methanosarcinaceae; g__Methanosarcina'"
## [2] "'1.0:p__Thermoplasmatota; c__E2; o__JACPAO01; f__JAHFTW01'"
## [3] "'1.0:p__Methanobacteriota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobacterium_B; s__Methanobacterium_B sp003151535'"
## [4] "'1.0:p__Thermoproteota'"
## [1] 46 50 54 55
# First need to preorder tree before extracting. N
tree_bac_preorder <- Preorder(tree_bac)
tree_Methylomirabilota <- Subtree(tree_bac_preorder, 1712)NEON_MAGs_Methylomirabilota <- NEON_MAGs_metagenomes_chemistry %>%
filter(Phylum == "Methylomirabilota")ggtree(tree_Methylomirabilota, layout="circular") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_tiplab(size=2, hjust=-.1) +
xlim(0,20) +
geom_point(mapping=aes(color=Class, shape = `Assembly Type`))## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_point()`).
This phylogeny tree is showing the classes within the phylum Methylomirabilota as well as whether or not they are co-assembled or individually assembled. This is another way to look at the data that was also provided in the sankey plots.
NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>%
rename("AssemblyType" = "Assembly Type") %>%
rename("BinCompleteness" = "Bin Completeness") %>%
rename("BinContamination" = "Bin Contamination") %>%
rename("TotalNumberofBases" = "Total Number of Bases") %>%
rename("EcosystemSubtype" = "Ecosystem Subtype")ggtree(tree_Methylomirabilota, layout="circular", branch.length="none") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`)) +
new_scale_fill() +
geom_fruit(
data=NEON_MAGs_metagenomes_chemistry_noblank,
geom=geom_tile,
mapping=aes(y=label, x=1, fill= AssemblyType),
offset=0.08, # The distance between external layers, default is 0.03 times of x range of tree.
pwidth=0.25 # width of the external layer, default is 0.2 times of x range of tree.
) +
new_scale_fill() +
geom_fruit(
data=NEON_MAGs_metagenomes_chemistry_noblank,
geom=geom_col,
mapping=aes(y=label, x=TotalNumberofBases),
pwidth=0.4,
axis.params=list(
axis="x", # add axis text of the layer.
text.angle=-45, # the text size of axis.
hjust=0 # adjust the horizontal position of text of axis.
),
grid.params=list() # add the grid line of the external bar plot.
) +
theme(#legend.position=c(0.96, 0.5), # the position of legend.
legend.background=element_rect(fill=NA), # the background of legend.
legend.title=element_text(size=7), # the title size of legend.
legend.text=element_text(size=6), # the text size of legend.
legend.spacing.y = unit(0.02, "cm") # the distance of legends (y orientation).
)## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH.
## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH, xmaxtmp.
## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).
This table is an extended phylogeny tree of the one from before. This includes the ecosystem sub type, individual and co-assemblys as well as the total number of bases.
Question: What are the soil temperature for this phylum?
This shows the soil temperature that each phylum is found at. Methylomirabilota has a smaller range of temperatures compared to other phylum. This range is from ~14-24°.
Question: What are the ecosystem sub types for Methylomirabilota?
NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>%
rename("AssemblyType" = "Assembly Type") %>%
rename("BinCompleteness" = "Bin Completeness") %>%
rename("BinContamination" = "Bin Contamination") %>%
rename("TotalNumberofBases" = "Total Number of Bases") %>%
rename("EcosystemSubtype" = "Ecosystem Subtype")
ggtree(tree_Methylomirabilota) %<+%
NEON_MAGs_metagenomes_chemistry +
geom_tippoint(aes(colour=`Ecosystem Subtype`)) +
# For unknown reasons the following does not like blank spaces in the names
geom_facet(panel = "Bin Completeness", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_point,
mapping=aes(x = BinCompleteness)) +
geom_facet(panel = "Bin Contamination", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_col,
aes(x = BinContamination), orientation = 'y', width = .6) +
theme_tree2(legend.position=c(.1, .7))This image has a combination of 3 tables. The far left is showing the ecosystem sub types in which Methylomirabilota is found in. The colors are coordinated to type of ecosystem they are found in. The other two graphs are showing the bin completeness and contamination. This is including the amount of gene markers that are shared between them.
As mentioned before, Methylomirabilis, or NC10, is a bacterial phylum. Within the NEON database, there are 23 collections of bacteria that belong to the Methylomirabilis phylum. Out of the 23 collections, 16 collections were collected by the Individual Assembly while 7 collections were from the NEON combined assembly. This lab will mostly focus on the 16 collections from the Individual Assembly.
As it can be seen within the Individual Assembly collections above, all 16 out of the 16 collections belong to the Class Methylomirabilia. However, when going deeper down into the taxonomy, 15 out of the 16 collections belong to the Order Rokubacteriales while 1 collection belongs to the Order Methylmirabilales.
Outside of the taxonomic breakdown, it can be seen that the Methylomirabilis Phylum was collected at 5 different NEON sites, with a majority of the collections at the National Grasslands LBJ and Konza Prairie BioStation sites. Within that, there are 6 genus that are also found within those sites. The two orders, methylomirabilis and rokubacterias were also found at those sites. However, Rokubacteriales was found at 4 locations while Methylmirablilales was only found at one. The soil temperature that this phylum was found to be at was 14-24°. This range is much smaller than all the other ranges of temperature for the other phylum.
The results that were obtained from the NEON data set as well as the visuals from above were used to make conclusions about the site and phylum we highlighted. We found that when looking at the individual assemblies for our phylum at the site, there is no evidence of it being found. However, the co-assembly shows evidence in which there is little presence of the phylum, Methylomirabilota. We were able to gather more information as to which phylum are prominent features to the ecosystem in Onaqui. These include Actinobacteria, Chloroflexota and Proteobacteria. Additionally, we were able to break down even further as to which orders and families were found. It was found that the average soil temperature at this site is much lower than the average soil temperature in which Methylomirabilota is found in. This could be one of the factors as to why it is not found there. By understanding Methylomirabilota, what it’s characteristics and functions are, and what environments it is found in, we can better understand its importance to the ecosystem most abundant in. Additionally, by gathering data from the site, we can analyze how the phylum of bacteria found within Onaqui impact the ecosystem in which it is today.
We thank NEON for creating useful, clear and important data in which we can further analyze. The importance of understanding these phylum and sites, specifically Onaqui and Methylomirabilota, is to better grasp the ecosystems in the USA. Working with the data collected from NEON, we were able to produce graphs and tables which visually display the data. With this, we were able to make conclusions as to what is present, and perhaps even why that it. There is much more data that is available within NEON that can allow us to further understand the different ecosystems in the US. In the future, with our data and findings, we can continue to learn more about how our ecosystems are changing and how we can adapt to them and protect them so they continue to be healthy and stable.
(He et al. 2016) (“Onaqui NEON NSF NEON Open Data to Understand Our Ecosystems” n.d.) (“DOE Joint Genome Institute: A DOE Office of Science User Facility of Lawrence Berkeley National Laboratory” n.d.) (Clum et al. n.d.) (Baxter 2018) (Holthuijzen and Veblen 2015)